Load Data

library(ggplot2)
data("diamonds")

Scatterplot Review

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(fill = "orange", color = "black", shape = 21) +
  coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
  ylim = c(0, quantile(diamonds$price, 0.99)))


Price and Carat Relationship

Seems like an exponential relationship. The variance also increases as carat size increases.

# A linear fit model doesn't predict the relationship.
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(fill = "orange", color = "black", shape = 21) +
  stat_smooth(method = "lm") +
  coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
  ylim = c(0, quantile(diamonds$price, 0.99)))


ggpairs Function

# install these if necessary
# install.packages('GGally')
# install.packages('scales')
# install.packages('memisc')
# install.packages('lattice')
# install.packages('MASS')
# install.packages('car')
# install.packages('reshape')
# install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
library(gridExtra)

# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, 
        lower = list(continuous = wrap("points", shape = I('.'))), 
        upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are some things you notice in the ggpairs output? * Might be a relationship between price-color and price-clarity * Strong relationship between price-carat * Weight = f(volume) = f(xyz) ==> Interested in cubic root of carat


The Demand of Diamonds

Notes: https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/

p1 = ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 100, fill = "cyan", color = "white")
p2 = ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 0.01, fill = "orange", color = "white") +
  scale_x_log10()
grid.arrange(p1, p2)


Connecting Demand and Price Distributions


Scatterplot Transformation

Notes: The prices now look less dispersed with log10 price but we can do better transforming the carat value too.

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(fill = "orange", color = "black", shape = 21) +
  scale_y_log10()

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

Notes: With the transformation of carat and price, the relationship looks linear.

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = TRUE))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = TRUE))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = .5, size = 3/4, position = "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes: Clarity, Cut and Color


Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
# install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', reverse = F,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price


Linear Models in R

Notes: Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).


Building the Linear Model

Notes: http://data.princeton.edu/R/linearModels.html

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## =============================================================================
##                       m1          m2          m3         m4          m5      
## -----------------------------------------------------------------------------
##   (Intercept)      2.821***    1.039***    0.874***    0.932***   0.415***   
##                   (0.006)     (0.019)     (0.019)     (0.017)    (0.010)     
##   I(carat^(1/3))   5.558***    8.568***    8.703***    8.438***   9.144***   
##                   (0.007)     (0.032)     (0.031)     (0.028)    (0.016)     
##   carat                       -1.137***   -1.163***   -0.992***  -1.093***   
##                               (0.012)     (0.011)     (0.010)    (0.006)     
##   cut: .L                                  0.224***    0.224***   0.120***   
##                                           (0.004)     (0.004)    (0.002)     
##   cut: .Q                                 -0.062***   -0.062***  -0.031***   
##                                           (0.004)     (0.003)    (0.002)     
##   cut: .C                                  0.051***    0.052***   0.014***   
##                                           (0.003)     (0.003)    (0.002)     
##   cut: ^4                                  0.018***    0.018***  -0.002      
##                                           (0.003)     (0.002)    (0.001)     
##   color: .L                                           -0.373***  -0.441***   
##                                                       (0.003)    (0.002)     
##   color: .Q                                           -0.129***  -0.093***   
##                                                       (0.003)    (0.002)     
##   color: .C                                            0.001     -0.013***   
##                                                       (0.003)    (0.002)     
##   color: ^4                                            0.029***   0.012***   
##                                                       (0.003)    (0.002)     
##   color: ^5                                           -0.016***  -0.003*     
##                                                       (0.003)    (0.001)     
##   color: ^6                                           -0.023***   0.001      
##                                                       (0.002)    (0.001)     
##   clarity: .L                                                     0.907***   
##                                                                  (0.003)     
##   clarity: .Q                                                    -0.240***   
##                                                                  (0.003)     
##   clarity: .C                                                     0.131***   
##                                                                  (0.003)     
##   clarity: ^4                                                    -0.063***   
##                                                                  (0.002)     
##   clarity: ^5                                                     0.026***   
##                                                                  (0.002)     
##   clarity: ^6                                                    -0.002      
##                                                                  (0.002)     
##   clarity: ^7                                                     0.032***   
##                                                                  (0.001)     
## -----------------------------------------------------------------------------
##   R-squared            0.924       0.935       0.939      0.951       0.984  
##   adj. R-squared       0.924       0.935       0.939      0.951       0.984  
##   sigma                0.280       0.259       0.250      0.224       0.129  
##   F               652012.063  387489.366  138654.523  87959.467  173791.084  
##   p                    0.000       0.000       0.000      0.000       0.000  
##   Log-likelihood   -7962.499   -3631.319   -1837.416   4235.240   34091.272  
##   Deviance          4242.831    3613.360    3380.837   2699.212     892.214  
##   AIC              15930.999    7270.637    3690.832  -8442.481  -68140.544  
##   BIC              15957.685    7306.220    3761.997  -8317.942  -67953.736  
##   N                53940       53940       53940      53940       53940      
## =============================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Data is from 2008, so we need to account for: * Inflation * 2008 global financial crisis * Diamond Market in China is growing rapidly * Uneven price increase across diferrent carat weight


A Bigger, Better Data Set

# install.package('bitops')
# install.packages('RCurl')
# library('bitops')
# library('RCurl')
# 
# diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
# load(rawConnection(diamondsurl))
load("BigDiamonds.Rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)),
         data = diamondsbig[diamondsbig$price < 10000 &
                              diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## =================================================================================
##                       m1           m2           m3          m4          m5       
## ---------------------------------------------------------------------------------
##   (Intercept)       2.671***     1.333***    0.949***    0.529***    -0.464***   
##                    (0.003)      (0.012)     (0.012)     (0.010)      (0.009)     
##   I(carat^(1/3))    5.839***     8.243***    8.633***    8.110***     8.320***   
##                    (0.004)      (0.022)     (0.021)     (0.017)      (0.012)     
##   carat                         -1.061***   -1.223***   -0.782***    -0.763***   
##                                 (0.009)     (0.009)     (0.007)      (0.005)     
##   cut: V.Good                                0.120***    0.090***     0.071***   
##                                             (0.002)     (0.001)      (0.001)     
##   cut: Ideal                                 0.211***    0.181***     0.131***   
##                                             (0.002)     (0.001)      (0.001)     
##   color: K/L                                             0.123***     0.117***   
##                                                         (0.004)      (0.003)     
##   color: J/L                                             0.312***     0.318***   
##                                                         (0.003)      (0.002)     
##   color: I/L                                             0.451***     0.469***   
##                                                         (0.003)      (0.002)     
##   color: H/L                                             0.569***     0.602***   
##                                                         (0.003)      (0.002)     
##   color: G/L                                             0.633***     0.665***   
##                                                         (0.003)      (0.002)     
##   color: F/L                                             0.687***     0.723***   
##                                                         (0.003)      (0.002)     
##   color: E/L                                             0.729***     0.756***   
##                                                         (0.003)      (0.002)     
##   color: D/L                                             0.812***     0.827***   
##                                                         (0.003)      (0.002)     
##   clarity: I1                                                         0.301***   
##                                                                      (0.006)     
##   clarity: SI2                                                        0.607***   
##                                                                      (0.006)     
##   clarity: SI1                                                        0.727***   
##                                                                      (0.006)     
##   clarity: VS2                                                        0.836***   
##                                                                      (0.006)     
##   clarity: VS1                                                        0.891***   
##                                                                      (0.006)     
##   clarity: VVS2                                                       0.935***   
##                                                                      (0.006)     
##   clarity: VVS1                                                       0.995***   
##                                                                      (0.006)     
##   clarity: IF                                                         1.052***   
##                                                                      (0.006)     
## ---------------------------------------------------------------------------------
##   R-squared             0.888        0.892       0.899       0.937        0.969  
##   adj. R-squared        0.888        0.892       0.899       0.937        0.969  
##   sigma                 0.289        0.284       0.275       0.216        0.154  
##   F               2700903.714  1406538.330  754405.425  423311.488   521161.443  
##   p                     0.000        0.000       0.000       0.000        0.000  
##   Log-likelihood   -60137.791   -53996.269  -43339.818   37830.414   154124.270  
##   Deviance          28298.689    27291.534   25628.285   15874.910     7992.720  
##   AIC              120281.582   108000.539   86691.636  -75632.827  -308204.540  
##   BIC              120313.783   108043.473   86756.037  -75482.557  -307968.400  
##   N                338946       338946      338946      338946       338946      
## =================================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you???ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
# We need to exponentiate the result of our model since we took the log of price.
exp(modelEstimate)
##        fit     lwr      upr
## 1 5040.436 3730.34 6810.638

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.